21 research outputs found

    Parallel programming paradigms and frameworks in big data era

    Get PDF
    With Cloud Computing emerging as a promising new approach for ad-hoc parallel data processing, major companies have started to integrate frameworks for parallel data processing in their product portfolio, making it easy for customers to access these services and to deploy their programs. We have entered the Era of Big Data. The explosion and profusion of available data in a wide range of application domains rise up new challenges and opportunities in a plethora of disciplines-ranging from science and engineering to biology and business. One major challenge is how to take advantage of the unprecedented scale of data-typically of heterogeneous nature-in order to acquire further insights and knowledge for improving the quality of the offered services. To exploit this new resource, we need to scale up and scale out both our infrastructures and standard techniques. Our society is already data-rich, but the question remains whether or not we have the conceptual tools to handle it. In this paper we discuss and analyze opportunities and challenges for efficient parallel data processing. Big Data is the next frontier for innovation, competition, and productivity, and many solutions continue to appear, partly supported by the considerable enthusiasm around the MapReduce paradigm for large-scale data analysis. We review various parallel and distributed programming paradigms, analyzing how they fit into the Big Data era, and present modern emerging paradigms and frameworks. To better support practitioners interesting in this domain, we end with an analysis of on-going research challenges towards the truly fourth generation data-intensive science.Peer ReviewedPostprint (author's final draft

    Intelligent services for big data science

    Get PDF
    Cities are areas where Big Data is having a real impact. Town planners and administration bodies just need the right tools at their fingertips to consume all the data points that a town or city generates and then be able to turn that into actions that improve peoples' lives. In this case, Big Data is definitely a phenomenon that has a direct impact on the quality of life for those of us that choose to live in a town or city. Smart Cities of tomorrow will rely not only on sensors within the city infrastructure, but also on a large number of devices that will willingly sense and integrate their data into technological platforms used for introspection into the habits and situations of individuals and city-large communities. Predictions say that cities will generate over 4.1 terabytes per day per square kilometer of urbanized land area by 2016. Handling efficiently such amounts of data is already a challenge. In this paper we present our solutions designed to support next-generation Big Data applications. We first present CAPIM, a platform designed to automate the process of collecting and aggregating context information on a large scale. It integrates services designed to collect context data (location, user's profile and characteristics, as well as the environment). Later on, we present a concrete implementation of an Intelligent Transportation System designed on top of CAPIM. The application is designed to assist users and city officials better understand traffic problems in large cities. Finally, we present a solution to handle efficient storage of context data on a large scale. The combination of these services provides support for intelligent Smart City applications, for actively and autonomously adaptation and smart provision of services and content, using the advantages of contextual information.Peer ReviewedPostprint (author's final draft

    A Simulation Model for Large Scale Distributed Systems

    Full text link
    The use of discrete-event simulators in the design and development of large scale distributed systems is appealing due to their efficiency and scalability. Their core abstractions ofprocess and event map neatly to the components and interactions of modern-day distributed systems and allow designing realistic simulation scenarios. MONARC 2, a multithreaded, process oriented simulation framework designed for modelling large scale distributed systems, allows the realistic simulation of a wide-range of distributed system technologies, with respect to their specific components and characteristics. In this paper we present the design characteristics of the simulation model proposed in MONARC 2. We demonstrate that this model includes the necessary components to describe various actual distributed system technologies, andprovides the mechanisms to describe concurrent network traffic, evaluate different strategies in data replication, and analyze job scheduling procedures. 1

    Total order in opportunistic networks

    Get PDF
    Opportunistic network applications are usually assumed to work only with unordered immutable messages, like photos, videos, or music files, while applications that depend on ordered or mutable messages, like chat or shared contents editing applications, are ignored. In this paper, we examine how total ordering can be achieved in an opportunistic network. By leveraging on existing dissemination and causal order algorithms, we propose a commutative replicated data type algorithm on the basis of Logoot for achieving total order without using tombstones in opportunistic networks where message delivery is not guaranteed by the routing layer. Our algorithm is designed to use the nature of the opportunistic network to reduce the metadata size compared to the original Logoot, and even to achieve in some cases higher hit rates compared to the dissemination algorithms when no order is enforced. Finally, we present the results of the experiments for the new algorithm by using an opportunistic network emulator, mobility traces, and Wikipedia pages.Peer ReviewedPostprint (author's final draft

    SPRINT-SELF: social-based routing and selfish node detection in opportunistic networks

    Get PDF
    Since mobile devices nowadays have become ubiquitous, several types of networks formed over such devices have been proposed. One such approach is represented by opportunistic networking, which is based on a store-carry-and-forward paradigm, where nodes store data and carry it until they reach a suitable node for forwarding. The problem in such networks is how to decide what the next hop will be, since nodes do not have a global view of the network. We propose using the social network information of a node when performing routing, since a node is more likely to encounter members of its own social community than other nodes. In addition, we approximate a node's contact as a Poisson distribution and show that we can predict its future behavior based on the contact history. Furthermore, since opportunistic network nodes may be selfish, we improve our solution by adding a selfish node detection and avoidance mechanism, which can help reduce the number of unnecessary messages sent in the network, and thus avoid congestion and decrease battery consumption. We show that our algorithm outperforms existing solutions such as BUBBLE Rap and Epidemic in terms of delivery cost and hit rate, as well as the rate of congestion introduced in the network, by testing in various realistic scenarios.Peer ReviewedPostprint (published version

    Intelligent services for big data science

    No full text
    Cities are areas where Big Data is having a real impact. Town planners and administration bodies just need the right tools at their fingertips to consume all the data points that a town or city generates and then be able to turn that into actions that improve peoples' lives. In this case, Big Data is definitely a phenomenon that has a direct impact on the quality of life for those of us that choose to live in a town or city. Smart Cities of tomorrow will rely not only on sensors within the city infrastructure, but also on a large number of devices that will willingly sense and integrate their data into technological platforms used for introspection into the habits and situations of individuals and city-large communities. Predictions say that cities will generate over 4.1 terabytes per day per square kilometer of urbanized land area by 2016. Handling efficiently such amounts of data is already a challenge. In this paper we present our solutions designed to support next-generation Big Data applications. We first present CAPIM, a platform designed to automate the process of collecting and aggregating context information on a large scale. It integrates services designed to collect context data (location, user's profile and characteristics, as well as the environment). Later on, we present a concrete implementation of an Intelligent Transportation System designed on top of CAPIM. The application is designed to assist users and city officials better understand traffic problems in large cities. Finally, we present a solution to handle efficient storage of context data on a large scale. The combination of these services provides support for intelligent Smart City applications, for actively and autonomously adaptation and smart provision of services and content, using the advantages of contextual information.Peer Reviewe

    MOMC: Multi-objective and Multi-constrained Scheduling Algorithm of Many Tasks in Hadoop

    No full text
    (c) 2014 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other users, including reprinting/ republishing this material for advertising or promotional purposes, creating new collective works for resale or redistribution to servers or lists, or reuse of any copyrighted components of this work in other works.Even though scheduling in a distributed system was debated for many years, the platforms and the job types are changing everyday. This is why we need special algorithms based on new applications requirements, especially when a application is deployed in a Cloud environment. One of the most important framework used for large-scale data processing in Clouds is Hadoop and its extensions. Hadoop framework comes with default algorithms like FIFO, Fair Scheduler or Capacity Scheduler, and Hadoop on Demand. These scheduling algorithms are focused on a different and single constraint. It is hard to satisfy multiple constraints and to have a lot of objectives in the same time. After summarizing the most common schedulers, showing the need of each one in the moment it appeared on the market, this paper presents MOMC, a multi-objective and multi-constrained scheduling algorithm of many tasks in Hadoop. MOMC implementation focuses on two objectives: avoiding resource contention and having an optimal workload of the cluster, and two constraints: deadline and budget. To compare the algorithms based on different metrics, we use Scheduling Load Simulator, which is integrated in Hadoop framework and helps the developers to spend less time on testing. As killer application that generate many tasks we have chosen processing task for the Million Song Dataset, which is a set of data contains metadata for one million commercially-available songs.Peer Reviewe

    MOMC: Multi-objective and Multi-constrained scheduling algorithm of many tasks in Hadoop

    No full text
    (c) 2014 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other users, including reprinting/ republishing this material for advertising or promotional purposes, creating new collective works for resale or redistribution to servers or lists, or reuse of any copyrighted components of this work in other works.Even though scheduling in a distributed system was debated for many years, the platforms and the job types are changing everyday. This is why we need special algorithms based on new applications requirements, especially when a application is deployed in a Cloud environment. One of the most important framework used for large-scale data processing in Clouds is Hadoop and its extensions. Hadoop framework comes with default algorithms like FIFO, Fair Scheduler or Capacity Scheduler, and Hadoop on Demand. These scheduling algorithms are focused on a different and single constraint. It is hard to satisfy multiple constraints and to have a lot of objectives in the same time. After summarizing the most common schedulers, showing the need of each one in the moment it appeared on the market, this paper presents MOMC, a multi-objective and multi-constrained scheduling algorithm of many tasks in Hadoop. MOMC implementation focuses on two objectives: avoiding resource contention and having an optimal workload of the cluster, and two constraints: deadline and budget. To compare the algorithms based on different metrics, we use Scheduling Load Simulator, which is integrated in Hadoop framework and helps the developers to spend less time on testing. As killer application that generate many tasks we have chosen processing task for the Million Song Dataset, which is a set of data contains metadata for one million commercially-available songs.Peer Reviewe
    corecore